Inferring Ideology from Incidents:
An Analysis of the Global Terrorism Database
By Tyler A. Clark

Introduction

In 2019, there were nearly 8,500 terrorist attacks around the world, killing more than 20,000 people (Source: Global Terrorism Overview). In order to successfully mitigate and combat terrorism it is imperative to understand the complex geopolitial dynamics that enable terrorism and terrorist ideologies. This project aims to analyze what one can infer about terrorist ideologies from data about their attacks. If one can identify causal relationships between characteristics of terrorists, including their ideologies and group structure, and the types of attacks perpetrated by these terrorists, then perhaps we could create informed policy to predict and mitigate terrorism.

Before proceeding, it is necessary to define key terms. Terrorism is notoriously difficult to define, and definitions are largely disagreed upon throughout industry, academia, and government. We will use data from the Global Terrorism Database (GTD) as a basis for our analysis, and therefore will utilize the definitions of terrorism and terrorist attacks provided in the datasets codebook:

A terrorist attack is the threatened or actual use of illegal force and violence by a non-state actor to gain political, economic, religious, or social goals through fear coercion, or intimidation.

(Source: GTD Codebook)

It is worth analyzing this definition to gain a better understanding of what is and what is not terrorism for the purposes of this project. The codebook indicates that terrorist attacks must be intentional, that does not mean that the attack is carried out exactly as intended, but rather that there is an intended target, a method by which to inflict harm, and perhaps evidence of planning. Additionally, a terrorist attack must include violence, or immediate threat of violence. This includes violence against both people and property. Violence in the codebook is to mean intention to cause injury and/or irrevocable destruction/kinectic damage. It is worth noting that the perpetrators must be sub-national actors. The database does not include acts of state terrorism--including persons who are employed by the state and/or are acting on behalf of a state or nation. This criteria does not exclude state-sponsored attacks, but rather only attacks perpetrated by state actors.

As with the definition above, two of the following three criteria must be met for inclusion in the dataset: 1. The act must be aimed at attaining a political, economic, religious, or social goal. In terms of economic goals, the exclusive pursuit of profit does not satisfy this criterion. It must involve the pursuit of more profound, systemic economic change. 2. There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims. It is the act taken as a totality that is considered, irrespective if every individual involved in carrying out the act was aware of this intention. As long as any of the planners or decision-makers behind the attack intended to coerce, intimidate or publicize, the intentionality criterion is met. 3. The action must be outside the context of legitimate warfare activities. That is, the act must be outside the parameters permitted by international humanitarian law.

For additional explanation of these criteria, as well as examples, please see the GTD Codebook.

Data Collection

Thankfully, the most tedious part of the data science pipeline has been done for us. The researchers over at the National Consortium for the Study of Terorrism and Responses to Terrorism (START) have collected data on global terrorism incidents from 1970-2019 in their Global Terrorism Database. The database, informed by open-source media articles, contains more than 100 structured variables characterize each attack’s location, tactics and weapons, targets, perpetrators, casualties and consequences, and general information such as definitional criteria and links between coordinated attacks. Unstructured variables include summary descriptions of the attacks and more detailed information on the weapons used, and specific motives of the attackers. The GTD is accessible for individuals and organizations from the GTD website.

While the methodology for collecting data has evolved since the inception of the database in 2006, it is worth mentioning the hybrid workflow the GTD employs to collect, process, and publish data today. The process starts with a pool of more than two million open-source media reports published each day. The GTD team combines automated and human workflows, leveraging the strengths and mitigating the limitations of each, to produce rich and reliable data. On the automated side, GTD researchers leverage boolean filters of articles, natural language processing (NLP), deduplication of articles, location identification, clustering of similar articles, and machine learning (ML) models to identify relevancy of articles. After the automated process has gathered, filtered and labeled articles, a team of analysts triage the articles to assess source validity, apply inclusion criteria, and create narratives of single incidents from multiple sources. The incidents are then coded by smaller teams, with specific domain expertise.

Data Processing I

After creating an indiviual-use account for the GTD, we download the dataset and import as a dataframe using pandas.

Perhaps the first thing to note is that this dataframe quite large to be manipulating in a Jupyter Notebok. It contains over 200,000 incidents and 135 columns, and takes up about 100MB of memory. A dataset of this size may not be considered "big data", but it warrants careful consideration of how we analyze the data to avoid long wait times and computational inefficiency. First we are going to "clean" the data, by taking a subset of the columns that we will be using for our analysis, then our dataframe will be easier to iterate over and operate on and easier to read. Let's start our analysis with some simple plots. First let's look at the number of terrorist incidents and casualties over time

The new dataframe is about one-sixth the size of the original, now 15.2MB. The original dataframe contained columns that described sources, validity, detailed text descriptions, and more. With the exception of two unstructured text columns, we have only kept structured data describing the incidents. We breakdown columns and what they are recording in the table below:

Column Name Variable Name Data Type Description
iyear Year interval This field contains the year in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the year when the incident was initiated.
imonth Month categorical This field contains the number of the month in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the month when the incident was initiated.
iday Day interval This field contains the numeric day of the month on which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the day when the incident was initiated.
country
country_txt
Country categorical This field identifies the country or location where the incident occurred. Separatist regions, such as Kashmir, Chechnya, South Ossetia, Transnistria, or Republic of Cabinda, are coded as part of the “home” country.
region
region_txt
Region categorical This field identifies the region in which the incident occurred. The regions are divided into the 13 categories, and dependent on the country coded for the case: North America, Central America & Caribbean, South America, East Asia, Southeast Asia, South Asia, Central Asia, Western Europe, Eastern Europe, Middle East & North Africa, Sub-Saharan Africa, amd Australasia & Oceania.
provstate Province/State text This variable records the name (at the time of event) of the 1st order subnational administrative region in which the event occurs.
city City text This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district).
attacktype1
attacktype1_txt
Attack Type categorical This field captures the general method of attack and often reflects the broad class of tactics used. It consists of nine categories, which are listed here: Assassination, Hijacking, Kidnapping, Barricade Incident, Bombing/Explosion, Armed Assault, Unarmed Assault, Facility/Infrastructure Attack, Unknown.
targtype1
targtype1_txt
Target Type categorical The target/victim type field captures the general type of target/victim. When a victim is attacked specifically because of his or her relationship to a particular person, such as a prominent figure, the target type reflects that motive. For example, if a family member of a government official is attacked because of his or her relationship to that individual, the type of target is “government.” This variable consists of 22 categories that can be found in the GTD codebook.
gname Perpetrator Group Name unstructured text This field contains the name of the group that carried out the attack. In order to ensure consistency in the usage of group names for the database, the GTD database uses a standardized list of group names that have been established by project staff to serve as a reference for all subsequent entries. In the event that the name of a formal perpetrator group or organization is not reported in source materials, this field may contain relevant information about the generic identity of the perpetrator(s) (e.g., “Protestant Extremists”). Note that these categories do not represent discrete entities. They are not exhaustive or mutually exclusive (e.g., “student radicals” and “left-wing militants” may describe the same people). They also do not characterize the behavior of an entire population or ideological movement. For many attacks, generic identifiers are the only information available about the perpetrators. Because of this they are included in the database to provide context; however, analysis of generic identifiers should be interpreted with caution.
city City text This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district).
motive Motive unstructured text When reports explicitly mention a specific motive for the attack, this motive is recorded in the “Motive” field. This field may also include general information about the political, social, or economic climate at the time of the attack if considered relevant to the motivation underlying the incident. Note: This field is presently only systematically available with incidents occurring after 1997.
weaptype1
weaptype1_txt
Weapon Type categorical This field records the general type of weapon used in the incident. It consists of the following categories: Biological, Chemical, Radiological, Nuclear, Firearms, Explosives, Fake Weapons, Incendiary, Melee, Vehicle, Sabotage Equipment, Other, and Unknown.
nkill Total Number of Fatalities ratio This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the incident. Where there is evidence of fatalities, but a figure is not reported or it is too vague to be of use, such as “many” or “some,” this field remains blank.
nwound Total Number of Injured ratio This field records the number of confirmed non-fatal injuries to both perpetrators and victims. It follows the conventions of the “Total Number of Fatalities” field described above.

With the exception of gname and motivation, all the variables we have included in the data frame are structured and well-defined. For additional information on each of the variables and examples of how they are coded, see the GTD codebook. Now that we have our dataframe, we can proceed to some exploratory data analysis.

Exploratory Data Analysis and Visualization I

Let's begin by examining the correlation between the variables in our dataset using a heatmap. Note that we need to unpivot our categorical variables before attempting to identify correlation between variables in our dataset, otherwise the corr method will assume that the categorical numerical variables are interval variables. I.e. you could perform well-defined operations including North America+Southeast Asia=Central Asia. For more information on types of data and the operations you can perform on them see: Type of data but not data-types.

The plot above is a correlation matrix that shows the correlation coefficient between each ratio variable in the dataset. Unfortunately it does not seem as though there are many variables that are highly correlated in our dataset. While our visualization is fun, it is clunky and hard to interpret. Let's examine further by finding all of the pairs of variables who correlation coefficient is greater than 0.3 or less than -0.3. It is generally accepted, although arbitrary, that 0.3 represents a weak correlation between variables, between 0.3 and 0.7 implies moderate correlation and 0.7 or greater implies strong correlation between variables. For more information on correlation coefficients, see: Boston University Correlation and Regression.

We have 17 pairs of variables that are at least weakly correlated. Let's see if any of the relationships are not obvious or artificial. The first threes pairs show a negative correlation between year and three region in the data sets. This likely points to a decrease in terrorism in Central America & Caribbean, South America, and Western Europe from 1970 through 2019. The fourth pair shows correlation between casualties and injuries, which is intuitive; the more people who are injured in an attack, the more likely there are to be casualties and vice-versa. Pairs six through eight indicate that there during incidents classifies as armed attacks there are typically firearms used rather than other types of weapons, this is obvious. Similarly the correlations in pairs 10 and 11, shows that during attack types where there is a bombing or explosive, there is a very strong correlation to the weapon type being a bomb; once again, this is obvious. Pair 12 shows that there is a strong correlation between incidents classified as attacks on infrastructure and the use of incindiery weapons. This likely just means that the number of arson cases in the GTD is far greater than any attacks on people using incindiery weapons. Pair 13 indicates that there is a weak correlation between incidents classified as kidnappings and incidents where the weapon type was unknown. Perplexingly, there is an interesting correlation between incidents classified assaults and the use of chemical weapons. This perhaps has to do with how incidents in the GTD are coded, but warrant further investigation. Pair 15 shows that there is a strong correlation between incidents where the weapon was unknown and incidents where the attack classification was unknown. This is likely a reflection of gaps in open-source data. Finally pairs 5, 16 and 17 are artificial correlations, since they were derived from the same variable and are therefore meaningless. Now that we have examined our correlated pairs for, let's investigate some of the less obvious correlations. Namely, the correlation between year and region; the correlation between incidiery weapons and attacks on infrastructure; and the correlation between unarmed assaults and the use of chemical weapons.

Let's continue by further exploring the relationship between year of a terrorist incident and the region in which the incident occurred. First let's plot some general information about terrorist attacks over time, such as the number of attacks per year and the number of casualties per year.

The plot above shows both the number of terrorist attacks per year and the number of casualties from terrorist per year from 1970 through 2019. Immediately we see that there is not a linear relationship between attacks and time, or casualties and time. Generally, we see trends of increasing and decreasing attacks and casualties over time, with a notable spike followed by a rapid decline in 2014. The spike coincides with the formation of the Islamic State of Iraq and Syria (ISIS) in 2013, and its declaration of a caliphate in 2014.

Machine Learning & Hypothesis Testing Part I

There is a strong and intuitive relationship between number of attacks and number of casualties. Let's plot casualties per year as a function of attacks per year and see if there is a linear relationship. We hypothesize, if there is a significant linear relationship between attacks per year and casualties per year, then the slope of a best fit line will not equal zero. i.e. $$H_0: B_1 = 0$$ $$H_1: B_1 \neq 0$$ where $B_1$ is the regression coefficient (or slope). We will choose a signifcance level of 0.05 for our test. We concede that the choice of 0.05 is relatively arbitrary; however it is the generally accepted value used when testing hypotheses. A significance level of 0.05 implies that there is a 5% chance we incorrectly reject the null hypothesis. We can then use a linear t-test to determine whether the slope of the regression line differs significantly from zero (i.e. whether we accept or reject the null hypothesis). We can use the statsmodel library to test our hypothesis:

In the plot above, we see the number of casualties from terrorist attacks per year plotted as a function of terrorist attacks per year. Visually, there is a strong positive, linear correlation, which is intuititive. The more attacks there are per year, the more casualties there are from attacks in a given year. Let's now test our hypothesis stated above.

Remember the p-value measures the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is true. (Source: Not even Scientists Can Easily Explain p-Values | FiveThirtyEight) The p-value is less than 0.05, in fact the p-value is less than 0.001, which implies there is less than a 0.1% chance that if the Null Hypothesis was true, then we would see data at least as correlated as the gapminder data actually is. Therefore, we reject the null hypothesis. Furthermore, let's look at the Pearson correlation coefficient between the variables.

A correlation coefficient greater than 0.7 implies that the number of attacks per year and the number of casualties per year are strongly correlated. Let's investigate if this trend holds if we disambiguate by region. First let's replicate the plots of attacks per year and casualties per year disambiguated by region.

The plots above show terrorist attacks per year and casualties per year disambiguated by region. Immediately we see strong correlations between casualties per year and attacks per year with a few exceptions, specifically North America, Southeast Asia, Western Europe, and East Asia. Again, we will plot Casualties per year as a function of attacks per year, now for each region, and we test the following hypothesis for each region: We hypothesize, if there is a significant linear relationship between attacks per year and casualties per year (in each region), then the slope of a best fit line will not equal zero. i.e. $$H_0: B_1 = 0$$ $$H_1: B_1 \neq 0$$ where $B_1$ is the regression coefficient (or slope). We will choose a signifcance level of 0.05 for our test.

The above plots show the number of casualties per year vs the number of attacks per year with one plot for each region in the table. Additionally, provide the results of our hypothesis tests below.

Region p-value Statistical Significant? Casualties/Attack (Slope)
Central America & Caribbean 0.000 Yes 3.41
North America 0.242 No -0.14*
South East Asia 0.000 Yes 0.52
Western Europe 0.000 Yes 0.32
East Asia 0.364 No 0.43*
South America 0.000 Yes 1.68
Eastern Europe 0.000 Yes 1.31
Sub-Saharan Africa 0.000 Yes 3.93
Middle East & North Africa 0.000 Yes 2.76
Australasia & Oceania 0.344 No 0.36*
South Asia 0.000 Yes 0.87
Central Asia 0.000 Yes 2.68

A * denotes that the casualties per attack, or slope, for a region is not a meaningful measure because there is no statistically significant linear relationship between the number of casualties per year and the number of attacks per year.

We see that all the regions in the dataset have statistically significant relationships between number of attacks per year and number of casualties per year except North America, East Asia, and Australasia & Oceania. Each of the three regions has one year with one significant outlier where casualties far outpaced the expected casualties given the number of attacks in a year. In North America, the outlier year corresponds to 2001, the year in which the September 11th Attacks took place. In East Asia, the outlier corresponds to attacks in China from Uighur Separatists throughout 2014. In Australasia & Oceania, the outlier corresponds to the 2019 Christchurch mosque shootings.

Insights Attained I

Through our hypothesis tests, we have shown that globally there is a statistically significant linear relationship between the number of terrorist attacks in a given year and the the number of casualties from terrorist attacks in a given year. We show this relationship is still statistically significant when you disambiguate by region with the exception of North America, East Asia, and Australasia & Oceania. In all three regions there was one significant outlier, where the number of casualties in a given year far exceeded the expected number of casualties based on our linear model. On the surface level, there does not seem to be any obvious similarities between the three outliers that would help create policy to mitigate the harm done by a single attack. The outlier years in North America and Australasia & Oceania both corresponded to a single incident in their given years, while the outlier year in East Asia corresponded to rising violence throughout an entire year. The motivation for the incidents for all three regions were different as well. In the case of the September 11 Attacks, Osama bin Laden stated in his 2002 "Letter to America" that al-Qaeda's attacks were motivated by U.S. occupation in the Middle East as well as U.S. support of governments who were in active conflicts against Muslims around the world, such as Israel, Somalia, Philippines, Russia, and India [Source: The Guardian]. In the case of the Christchurch mosque shootings, the perpetrator, Brenton Tarrant, was motivated by white supremacy, xenophobia and islamophobia [Source: Time]. In the case of the Xinjiang conflict, Uighur Separatists, backed by affiliates of the Islamic State and al-Qaeda, sought to separate from China in order to escape religious persecution. For more information about the conflict, see Devastating Blows: Religious Repression of Uighurs in Xinjiang, The Xinjiang Papers, and Xinjiang Attacks. None of the attacks across regions used the same types of weapons or had similar target types.

Overall there does not seem like there is much insight we gained from this analysis. We are no closer to understanding why these outliers exist, or how to prevent them.

Data Processing II

After the disappointing first act, we will attempty to extrapolate information about motives of perpetrators in the Global Terrorism Database. The GTD does not have a categorical variable for motivation, but instead has an unstructured variable that we will attempt to analyze. Let's begin by cleaning the motivation variable and performing some exploratory data analysis. To clean the variable, we will drop all rows where motivation is missing. Additionally, we will alter the rows so they contain no punctuation and are all in lowercase.

We start by dropping any incidents with no motive coded. We started with 201,183 incidents and find we have 53,629 with a motivation coded. Additionally we removed punctuation from the column and made each entry lowercase.

Exploratory Data Analysis and Visualization II

Before we attempt any natural language processing on our row, let's use a wordcloud to visualize the which terms appear most frequently in motivation.

Immediately, we see that there are a lot of frequent terms that aren't useful to us, such as "unknwon", "specific", and "motive". It seems as though we will need to process the data further and remove words from the corpus that are not insightful. We will use the gensim and nltk libraries to further process the data, by parsing each entry into a list words and removing stop words, terms that we deem irrelevant, from the corpus. Finally we revisualize, the parsed corpus, absent stop words, as a word cloud.

The world cloud above is vastly different than the one we started with, but it is still not very informative. The largest terms often refer to groups or locations rather than an attacks ideological motivation. For example, "Islamic State" and "ISIL" both refer to a group name, while "Iraq" and "Levant" refer to places, but also possibly the group ISIS. We are starting to see more informative motivations such as "Sectarian Pakistan", "Destablize Algeria", and "Elections scheduled".

Machine Learning & Hypothesis Testing Part II

We will now use Topic Modeling to attempt to extract common topics, or in this case generalized motivations, from our corpus. Topic Modeling is NLP technique known as Topic Modeling to find hidden semantic structures in documents. It is a form of unstructured learning that attempts to cluster similar groups of documents together. Specifically, we will be using Latent Dirichlet Allocation (LDA). LDA assumes that each document is a combination of a fixed number of topics. And that each topic has a

Insights Attained II